Bayesian Hierarchical Clustering for Studying Cancer Gene Expression Data with Unknown Statistics
نویسندگان
چکیده
Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites.google.com/site/gaussianbhc/
منابع مشابه
به کارگیری روشهای خوشهبندی در ریزآرایه DNA
Background: Microarray DNA technology has paved the way for investigators to expressed thousands of genes in a short time. Analysis of this big amount of raw data includes normalization, clustering and classification. The present study surveys the application of clustering technique in microarray DNA analysis. Materials and methods: We analyzed data of Van’t Veer et al study dealing with BRCA1...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملA Quantitative Study of Gene Regulation Involved in the Immune Response of Anopheline Mosquitoes: An Application of Bayesian Hierarchical Clustering of Curves
Malaria represents one of the major worldwide challenges to public health. A recent breakthrough in the study of the disease follows the annotation of the genome of the malaria parasite Plasmodium falciparum and the mosquito vector 1 Anopheles. Of particular interest is the molecular biology underlying the immune response system of Anopheles which actively fights against Plasmodium infection. T...
متن کاملBayesian binary kernel probit model for microarray based cancer classification and gene selection
With the arrival of gene expression microarrays a new challenge has opened up for identification or classification of cancer tissues. Due to the large number of genes providing valuable information simultaneously compared to very few available tissue samples the cancer staging or classification becomes very tricky. In this paper we introduce a hierarchical Bayesian probit model for two class ca...
متن کاملA modified correlation coefficient based similarity measure for clustering time-course gene expression data
Gene expression levels are often measured consecutively in time through microarray experiments to detect cellular processes underlying regulatory effects observed and to assign functionality to genes whose function is yet unknown. Clustering methods allow us to group genes that show similar time-course expression profiles and that are thus likely to be co-regulated. The correlation coefficient,...
متن کامل